nmer_OriginalHomeworkCode_03
5 Challenges
- Received this error while trying to run a qqplot for years educated: Error in hist.default(a, breaks = seq(0, 9, 1), probability = TRUE, main = “Probability of Years Educated”, : some ‘x’ not counted; maybe ‘breaks’ do not span range of ‘x’. I realized before I could even google it that it was because I had not swapped out my data varialbe form the last qqplot! I switched it from a for age to ye for years educated and it ran.
- At first I was not sure exactly how to plot the histograms in question 4. I looked back in previous modules and realized it was very similar to simulating a distrubtion to rnorm then plotting those results in a histogram, but here we already had a distribution in the various quantitative variables.I used max() and min() for each variable to determine the x axis range and breaks.
- I could not get the green run code button to work on my last two chunks for my histograms/qqplots in question 4. I tried knitting it to see if I could see anything in the rendered version that showed any problems or if ran the code there. I noticed an erroneous ``` in the text immediately preceding the problematic chunks. I went back to the source and deleted it. When I did that, I was able to run the two chunks.
- Got this error when trying to do sample(d, 30, replace=FALSE): “Error in sample.int(length(x), size, replace, prob) : cannot take a sample larger than the population when ‘replace = FALSE’”. I googled it and it seems like using a dplyr function (slice_sample()) is easier than trying to make sample() work in this context. I was able to “slice” 30 rows of the original 1000 in one line of code. I also decided to use set.seed() so that the data would be the same every time it is run.
- I originally switched qnorm for qpois to do the CI but the upper and lower turned out the same values. I realized I needed to calculate the Standard Errors differently for those. However, that did not fix it. I searched online and found a lot of places saying to get an exact poisson CI, you need to use chi squared with alpha values. I found one website which had a formula for this. I used that formula but left my original method in the chunks in case I want to workshop it more. Here is where I found the formula: https://stats.stackexchange.com/questions/10926/how-to-calculate-confidence-interval-for-count-data-in-r Bonus 6th challenge: See Question 6 for graveyard of abandoned code trying to sample the data 99 times. I continued trying to use slice_sample(d) with the replicate function which replicated the data set 99 times with vectors of length 30 in each cell. I could not subset variables from it and take means, even with for loops because things were not the correct class or data type. Don’t write code late at night! I talked about it with Jimmy and we agreed the general idea was right but instead needed to go back to regular old sample() and subset the data to each variable within sample. Then use mean around sample, then replicate around the whole thing with 99. Smooth sailing after that.
file <- curl('https://raw.githubusercontent.com/fuzzyatelin/fuzzyatelin.github.io/master/AN588_Fall23/zombies.csv')
d <- read.csv(file, header= TRUE, sep = ",", stringsAsFactors = FALSE)
head(d)## id first_name last_name gender height weight zombies_killed
## 1 1 Sarah Little Female 62.88951 132.0872 2
## 2 2 Mark Duncan Male 67.80277 146.3753 5
## 3 3 Brandon Perez Male 72.12908 152.9370 1
## 4 4 Roger Coleman Male 66.78484 129.7418 5
## 5 5 Tammy Powell Female 64.71832 132.4265 4
## 6 6 Anthony Green Male 71.24326 152.5246 1
## years_of_education major age
## 1 1 medicine/nursing 17.64275
## 2 3 criminal justice administration 22.58951
## 3 1 education 21.91276
## 4 6 energy studies 18.19058
## 5 3 logistics 21.10399
## 6 4 energy studies 21.48355
## 'data.frame': 1000 obs. of 10 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ first_name : chr "Sarah" "Mark" "Brandon" "Roger" ...
## $ last_name : chr "Little" "Duncan" "Perez" "Coleman" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ height : num 62.9 67.8 72.1 66.8 64.7 ...
## $ weight : num 132 146 153 130 132 ...
## $ zombies_killed : int 2 5 1 5 4 1 0 4 9 2 ...
## $ years_of_education: int 1 3 1 6 3 4 4 0 3 3 ...
## $ major : chr "medicine/nursing" "criminal justice administration" "education" "energy studies" ...
## $ age : num 17.6 22.6 21.9 18.2 21.1 ...
Question 1
Calculate the population mean and standard deviation for each quantitative random variable (height, weight, age, number of zombies killed, and years of education). NOTE: You will not want to use the built in var() and sd() commands as these are for samples.
## [1] 67.6301
## [1] 143.9075
## [1] 20.04696
## [1] 2.992
## [1] 2.996
For population standard deviation, first will define a function for population variance, not sample variance. This is the sum of squares/n.
## [1] 18.55861
## [1] 338.2604
## [1] 8.782822
## [1] 3.053936
## [1] 2.807984
To get the population SD, I just need to define a function taking the square root of the population variance
## [1] 4.30797
## [1] 18.39186
## [1] 2.963583
## [1] 1.747551
## [1] 1.675704
Question 2
Use {ggplot} to make boxplots of each of these variables by gender.
ph <- ggplot(data = d, aes(x = gender, y = h))
ph <- ph + geom_boxplot()
ph <- ph + theme(axis.text.x = element_text(angle = 90))
ph <- ph + ylab("Heights of Pop")
phpw <- ggplot(data = d, aes(x = gender, y = w))
pw <- pw + geom_boxplot()
pw <- pw + theme(axis.text.x = element_text(angle = 90))
pw <- pw + ylab("Weights of Pop")
pwpa <- ggplot(data = d, aes(x = gender, y = a))
pa <- pa + geom_boxplot()
pa <- pa + theme(axis.text.x = element_text(angle = 90))
pa <- pa + ylab("Ages of Pop")
papzk <- ggplot(data = d, aes(x = gender, y = zk))
pzk <- pzk + geom_boxplot()
pzk <- pzk + theme(axis.text.x = element_text(angle = 90))
pzk <- pzk + ylab("Number of Zombies Killed")
pzkpye <- ggplot(data = d, aes(x = gender, y = ye))
pye <- pye + geom_boxplot()
pye <- pye + theme(axis.text.x = element_text(angle = 90))
pye <- pye + ylab("Years Educated")
pyeQuestion 3
Use {ggplot} to make scatterplots of height and weight in relation to age. Do these variables seem to be related? In what way?
ph1 <- ggplot(data = d, aes(x = a, y = h, color = factor(gender)))
ph1 <- ph1 + xlab("Age") + ylab("Height")
ph1 <- ph1 + geom_point()
ph1 <- ph1 + theme(legend.position = "bottom", legend.title = element_blank())
ph1pw1 <- ggplot(data = d, aes(x = a, y = w, color = factor(gender)))
pw1 <- pw1 + xlab("Age") + ylab("Weight")
pw1 <- pw1 + geom_point()
pw1 <- pw1 + theme(legend.position = "bottom", legend.title = element_blank())
pw1
Looks like there is a relationship between age and hieght- height is
dependent on age. As age goes up, height also goes up. However, there is
no such relationship between age and weight.
Question 4
Using histograms and Q-Q plots, check whether the quantitative variables seem to be drawn from a normal distribution. Which seem to be and which do not (hint: not all are drawn from the normal distribution)? For those that are not normal, can you determine from which common distribution they are drawn?
## [1] 54.14948
## [1] 80.5298
hist(h, breaks = seq(50,90,1), probability= TRUE, main = "Probability of Height of Pop", xlab = "Height", ylab = "probability")
qqnorm(h, main = "Normal QQ Plot for Height of Pop")
qqline(h, col = "gray")## [1] 90.29148
## [1] 210.7895
hist(w, breaks = seq(80,220,1), probability= TRUE, main = "Probability of Weight of Pop", xlab = "Weight", ylab = "probability")
qqnorm(w, main = "Normal QQ Plot Weight of Pop")
qqline(w, col = "gray")## [1] 10.66381
## [1] 29.59488
hist(a, breaks = seq(9,30,1), probability= TRUE, main = "Probability of Age of Pop", xlab = "Age", ylab = "probability")
qqnorm(a, main = "Normal QQ Plot Age of Pop")
qqline(a, col = "gray")## [1] 0
## [1] 11
hist(zk, breaks = seq(-1,12,1), probability= TRUE, main = "Probability of Zombie Kills", xlab = " of Zombie Kills", ylab = "probability")
qqnorm(zk, main = "Normal QQ Plot of Zombie Kills")
qqline(zk, col = "gray")This is a poisson distribution which makes sense because it is a similar scenario to the titi calls. How many kills are observed?
## [1] 0
## [1] 8
hist(ye, breaks = seq(0,9,1), probability= TRUE, main = "Probability of Years Educated", xlab = "Years", ylab = "probability")
qqnorm(ye, main = "Normal QQ Plot Years Educated")
qqline(ye, col = "gray")This is also a poisson distribution, for similar reasons as above.
Question 5
Now use the sample() function to sample ONE subset of 30 zombie survivors (without replacement) from this population and calculate the mean and sample standard deviation for each variable. Also estimate the standard error for each variable, and construct the 95% confidence interval for each mean. Note that for the variables that are not drawn from the normal distribution, you may need to base your estimate of the CIs on slightly different code than for the normal…
## [1] 66.88522
## [1] 142.2536
## [1] 19.68657
## [1] 3.066667
## [1] 2.966667
Can use built in function this time because it is a sample
## [1] 3.735037
## [1] 16.22658
## [1] 2.960476
## [1] 1.680175
## [1] 1.691425
The standard error can be calculated many different ways. Can divide each sd by 30, can create my own function, or I can use the sciplot built in function. I will load in sciplot and use that function. Standard error for the poisson I will do separetly since it just sqrt(lambda/n)
## [1] 0.6819214
## [1] 2.962555
## [1] 0.5405065
This sample is relatively close to the population. Most variables are at less than one standard error. Weight is almost 3 standard errors away, however.
upperH <- mh + qnorm(0.975, mean = 0, sd = 1) * se(h)
lowerH <- mh + qnorm(0.025, mean = 0, sd = 1) * se(h)
ciH <- c(lowerH, upperH)
ciH## [1] 65.54868 68.22176
upperW <- mw + qnorm(0.975, mean = 0, sd = 1) * se(w)
lowerW <- mw + qnorm(0.025, mean = 0, sd = 1) * se(w)
ciW <- c(lowerW, upperW)
ciW## [1] 136.4471 148.0601
upperA <- ma + qnorm(0.975, mean = 0, sd = 1) * se(a)
lowerA <- ma + qnorm(0.025, mean = 0, sd = 1) * se(a)
ciA <- c(lowerA, upperA)
ciA## [1] 18.62719 20.74594
upperZK <- mzk + qpois(0.975, lambda = 0) * seZK
lowerZK <- mzk + qpois(0.025, lambda = 0) * seZK
ciZK <- c(lowerZK, upperZK)
ciZK## [1] 3.066667 3.066667
#this obviously doesn't work since it returns the same thing for both upper and lower
exactPoiCI <- function (X, conf.level=0.95) {
alpha = 1 - conf.level
upper <- 0.5 * qchisq((1-(alpha/2)), (2*X)) #chi square necessary here
lower <- 0.5 * qchisq(alpha/2, (2*X +2))
return(c(lower, upper))
}
exactPoiCI(mzk, conf.level = 0.95) #0.95 is default but good to show what we are asking for## [1] 1.123747 7.330303
upperYE <- mean(ye) + qpois(0.975, lambda = 0) * seYE
lowerYE <- mean(ye) + qpois(0.025, lambda = 0) * seYE
ciYE <- c(lowerYE, upperYE)
ciYE## [1] 2.966667 2.966667
## [1] 1.073027 7.171705
Question 6
Now draw 99 more random samples of 30 zombie apocalypse survivors, and calculate the mean for each variable for each of these samples. Together with the first sample you drew, you now have a set of 100 means for each variable (each based on 30 observations), which constitutes a sampling distribution for each variable. What are the means and standard deviations of this distribution of means for each variable? How do the standard deviations of means compare to the standard errors estimated in [5]? What do these sampling distributions look like (a graph might help here)? Are they normally distributed? What about for those variables that you concluded were not originally drawn from a normal distribution?
Here’s what I tried first and it created a massive matrix with every cell containing a vector:
x <- replicate(99, slice_sample(d, n = 30, replace = FALSE)) #I refused to use a for loop here x <- t(x) #transposed so that the column names are the variables as.data.frame(x)
I blame slice_sample now!
Graveyard of attempted code to get means from my overly complicated matrix:
mHeight <- lapply(x[1:99, ], mean(x[, 5])) mWeight <- mean(x[6, 1:99]) mZombies_Killed <- mean(x[7, 1:99]) mYears_Ed <- mean(x[8, 1:99]) mAge <- mean(x[10, 1:99]) # heights <- x[1:99, 5] mHeights <- NULL for (i in heights) { mHeights[i] <- mean(x$height) }
Instead I am going to sample each variable independently (and that way I shouldn’t need to use slice_sample either) and take the means directly in that line of code.
set.seed(2)
mHeight <- replicate(99, mean(sample(d$height, 30, replace = F))) #taking a sample of heights, averaging, then repeating 99 more times
mWeight <- replicate(99, mean(sample(d$weight, 30, replace = F)))
mAge <- replicate(99, mean(sample(d$age, 30, replace = F)))
mZombies_Killed <- replicate(99, mean(sample(d$zombies_killed, 30, replace = F)))
mYears_Ed <- replicate(99, mean(sample(d$years_of_education, 30, replace = F)))Combine the first sample with the new ones:
mHeights <- c(mh, mHeight)
mWeights <- c(mw, mWeight)
mAges <- c(ma, mAge)
mKills <- c(mzk, mZombies_Killed)
mEducation <- c(mye, mYears_Ed)Means of Means, and SDs of Means
## [1] 67.653
## [1] 143.4503
## [1] 20.1015
## [1] 3.033333
## [1] 3.005333
## [1] 0.8021361
## [1] 3.178964
## [1] 0.5241518
## [1] 0.309556
## [1] 0.2536256
## [1] 0.6819214
## [1] 2.962555
## [1] 0.5405065
## [1] 0.3197221
## [1] 0.314466
Except for weights, there standard deviation values are pretty small suggesting that there is little variance. The standard errors from the single sampling in question 5 were pretty small as well (except for weights again) which suggests that they were fairly representative of the population. The standard deviations are pretty close to each of their respective standard errors which is indicative of one of the rules of the central limit theorem - that the standard deviation will be nearly equal to the standard error of the mean.
Below I have plotted each sampling distribution, with two lines: - the black line is the mean of the sampling distributions -the red is the mean of the population For the central limit theorem, the disitribution should be approximately centered on the population mean. I also included qqplots to show that they are normally distributed. I opted not go do par() because it made the plots too small to read.
hist(mKills, probability = T)
abline(v = mean(mKills))
abline(v= mean(d$zombies_killed), col = "red")hist(mEducation, probability = T)
abline(v = mean(mEducation))
abline(v= mean(d$years_of_education), col = "red")qqnorm(mEducation, main = "Normal QQ Plot for Sampled Mean Years Educated")
qqline(mEducation, col = "gray")They are all normally distributed, including years educated and zombies killed which were poisson distributed before. This demonstrates the central limit theorem!